Skip to content

Comments

fix(kafkajs): include kafka_cluster_id in DSM backlog offset tracking#7569

Open
robcarlan-datadog wants to merge 6 commits intomasterfrom
rob.carlan/DSMON-1226/kafkajs-dsm-backlog-cluster-id
Open

fix(kafkajs): include kafka_cluster_id in DSM backlog offset tracking#7569
robcarlan-datadog wants to merge 6 commits intomasterfrom
rob.carlan/DSMON-1226/kafkajs-dsm-backlog-cluster-id

Conversation

@robcarlan-datadog
Copy link
Contributor

@robcarlan-datadog robcarlan-datadog commented Feb 18, 2026

Summary

  • DSM checkpoints correctly included kafka_cluster_id in edge tags, but the backlog/offset tracking (which feeds data_streams.kafka.lag_messages and data_streams.kafka.lag_seconds) did not include it
  • When the same topic exists on multiple Kafka clusters, producer offsets from different clusters were mixed into a single metric series, causing incorrect lag calculations
  • This fix threads clusterId through to setOffset() calls for both producer and consumer commit paths so that backlog entries are properly scoped per cluster

Changes

  • packages/datadog-instrumentations/src/kafkajs.js: Capture resolved clusterId in closure and include it in consumer COMMIT_OFFSETS event data
  • packages/datadog-plugin-kafkajs/src/producer.js: Extract clusterId from context and pass to transformProduceResponse, include kafka_cluster_id in backlog
  • packages/datadog-plugin-kafkajs/src/consumer.js: Extract clusterId from commit items and include kafka_cluster_id in backlog
  • packages/datadog-plugin-kafkajs/test/dsm.spec.js: Assert kafka_cluster_id is present in backlog entries when cluster ID is available

Customer impact

Customers with the same topic on multiple Kafka clusters saw data_streams.kafka.lag_messages oscillate between cluster offsets (e.g., ~60k and ~84k), producing ~23k phantom lag messages. All cluster-scoped metrics (CloudWatch, DD Agent) showed 0-1 messages lag during the same window.

Testing

Ran DSM without the fix and verified no kafka_cluster_id for the lag_messages and lag_seconds metric:
Screenshot 2026-02-19 at 11 09 17 am

Ran DSM with the fix and verified the kafka_cluster_id appeared:
Screenshot 2026-02-19 at 11 09 59 am

🤖 Generated with Claude Code

DSM checkpoints correctly included kafka_cluster_id in edge tags, but
the backlog/offset tracking (which feeds lag metrics like
data_streams.kafka.lag_messages and data_streams.kafka.lag_seconds) did
not. This caused cross-cluster offset mixing when the same topic exists
on multiple Kafka clusters, producing wildly incorrect lag values.

Thread clusterId through to setOffset calls for both producer and
consumer commit paths so that backlog entries are scoped per cluster.

DSMON-1226

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

Overall package size

Self size: 4.79 MB
Deduped: 5.63 MB
No deduping: 5.63 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 2.0.6 | 81.92 kB | 816.75 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.29%. Comparing base (631fb6a) to head (a304de5).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
packages/datadog-plugin-kafkajs/src/consumer.js 0.00% 5 Missing ⚠️
packages/datadog-plugin-kafkajs/src/producer.js 16.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7569      +/-   ##
==========================================
- Coverage   80.32%   80.29%   -0.03%     
==========================================
  Files         733      733              
  Lines       31546    31570      +24     
==========================================
+ Hits        25338    25349      +11     
- Misses       6208     6221      +13     
Flag Coverage Δ
aiguard-macos 38.93% <ø> (-0.10%) ⬇️
aiguard-ubuntu 39.06% <ø> (-0.10%) ⬇️
aiguard-windows 38.79% <ø> (-0.11%) ⬇️
apm-capabilities-tracing-macos 48.54% <0.00%> (-0.09%) ⬇️
apm-capabilities-tracing-ubuntu 48.62% <0.00%> (-0.04%) ⬇️
apm-capabilities-tracing-windows 48.32% <0.00%> (-0.04%) ⬇️
apm-integrations-child-process 38.51% <ø> (-0.10%) ⬇️
apm-integrations-couchbase-18 37.28% <ø> (-0.24%) ⬇️
apm-integrations-couchbase-eol 37.76% <ø> (-0.25%) ⬇️
apm-integrations-oracledb 37.73% <ø> (-0.09%) ⬇️
appsec-express 55.53% <ø> (-0.07%) ⬇️
appsec-fastify 51.84% <ø> (-0.07%) ⬇️
appsec-graphql 52.03% <ø> (-0.07%) ⬇️
appsec-kafka 44.45% <50.00%> (-0.18%) ⬇️
appsec-ldapjs 44.09% <ø> (-0.08%) ⬇️
appsec-lodash 43.78% <ø> (-0.08%) ⬇️
appsec-macos 58.61% <ø> (-0.07%) ⬇️
appsec-mongodb-core 48.84% <ø> (-0.19%) ⬇️
appsec-mongoose 49.63% <ø> (-0.08%) ⬇️
appsec-mysql 51.02% <ø> (-0.07%) ⬇️
appsec-node-serialize 43.29% <ø> (-0.08%) ⬇️
appsec-passport 47.78% <ø> (-0.08%) ⬇️
appsec-postgres 50.77% <ø> (-0.07%) ⬇️
appsec-sourcing 42.64% <ø> (-0.08%) ⬇️
appsec-template 43.46% <ø> (-0.08%) ⬇️
appsec-ubuntu 58.69% <ø> (-0.07%) ⬇️
appsec-windows 58.45% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-bluebird 32.21% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-body-parser 40.52% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-child_process 37.82% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-cookie-parser 34.25% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express 34.58% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express-mongo-sanitize 34.38% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express-session 40.14% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-fs 31.81% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-generic-pool 29.76% <ø> (ø)
instrumentations-instrumentation-http 39.86% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-knex 32.21% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-mongoose 33.37% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-multer 40.26% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-mysql2 38.30% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-passport 44.09% <ø> (-0.08%) ⬇️
instrumentations-instrumentation-passport-http 43.76% <ø> (-0.08%) ⬇️
instrumentations-instrumentation-passport-local 44.31% <ø> (-0.08%) ⬇️
instrumentations-instrumentation-pg 37.72% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-promise 32.13% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-promise-js 32.14% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-q 32.18% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-url 32.10% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-when 32.15% <ø> (-0.10%) ⬇️
llmobs-ai 41.33% <ø> (-0.23%) ⬇️
llmobs-anthropic 40.33% <ø> (-0.09%) ⬇️
llmobs-bedrock 39.26% <ø> (-0.08%) ⬇️
llmobs-google-genai 39.85% <ø> (-0.09%) ⬇️
llmobs-langchain 39.43% <ø> (-0.07%) ⬇️
llmobs-openai 44.14% <ø> (-0.09%) ⬇️
llmobs-vertex-ai 40.04% <ø> (-0.11%) ⬇️
platform-core 29.71% <ø> (ø)
platform-esbuild 32.89% <ø> (ø)
platform-instrumentations-misc 40.53% <ø> (ø)
platform-shimmer 36.14% <ø> (ø)
platform-unit-guardrails 31.27% <ø> (ø)
plugins-azure-event-hubs 24.02% <ø> (ø)
plugins-azure-service-bus 23.42% <ø> (ø)
plugins-bullmq 43.78% <ø> (+0.01%) ⬆️
plugins-cassandra 37.77% <ø> (-0.09%) ⬇️
plugins-cookie 25.08% <ø> (ø)
plugins-cookie-parser 24.87% <ø> (ø)
plugins-crypto 24.72% <ø> (ø)
plugins-dd-trace-api 38.37% <ø> (-0.10%) ⬇️
plugins-express-mongo-sanitize 25.04% <ø> (ø)
plugins-express-session 24.83% <ø> (ø)
plugins-fastify 42.28% <ø> (-0.09%) ⬇️
plugins-fetch 38.32% <ø> (-0.09%) ⬇️
plugins-fs 38.61% <ø> (-0.10%) ⬇️
plugins-generic-pool 24.06% <ø> (ø)
plugins-google-cloud-pubsub 45.46% <ø> (-0.09%) ⬇️
plugins-grpc 40.97% <ø> (-0.09%) ⬇️
plugins-handlebars 25.08% <ø> (ø)
plugins-hapi 40.15% <ø> (-0.09%) ⬇️
plugins-hono 40.41% <ø> (-0.09%) ⬇️
plugins-ioredis 38.42% <ø> (-0.10%) ⬇️
plugins-knex 24.80% <ø> (ø)
plugins-ldapjs 22.61% <ø> (ø)
plugins-light-my-request 24.48% <ø> (ø)
plugins-limitd-client 32.50% <ø> (-0.10%) ⬇️
plugins-lodash 24.13% <ø> (ø)
plugins-mariadb 39.50% <ø> (-0.10%) ⬇️
plugins-memcached 38.15% <ø> (-0.10%) ⬇️
plugins-microgateway-core 39.18% <ø> (-0.09%) ⬇️
plugins-moleculer 40.53% <ø> (-0.09%) ⬇️
plugins-mongodb 39.20% <ø> (-0.17%) ⬇️
plugins-mongodb-core 39.04% <ø> (-0.10%) ⬇️
plugins-mongoose 38.86% <ø> (-0.09%) ⬇️
plugins-multer 24.83% <ø> (ø)
plugins-mysql 39.14% <ø> (-0.13%) ⬇️
plugins-mysql2 39.27% <ø> (-0.10%) ⬇️
plugins-node-serialize 25.12% <ø> (ø)
plugins-opensearch 37.60% <ø> (-0.09%) ⬇️
plugins-passport-http 24.91% <ø> (ø)
plugins-postgres 35.70% <ø> (-0.08%) ⬇️
plugins-process 24.72% <ø> (ø)
plugins-pug 25.08% <ø> (ø)
plugins-redis 38.89% <ø> (-0.10%) ⬇️
plugins-router 43.03% <ø> (-0.09%) ⬇️
plugins-sequelize 23.66% <ø> (ø)
plugins-test-and-upstream-amqp10 38.49% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-amqplib 43.90% <ø> (-0.05%) ⬇️
plugins-test-and-upstream-apollo 39.03% <ø> (-0.09%) ⬇️
plugins-test-and-upstream-avsc 38.70% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-bunyan 33.80% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-connect 40.82% <ø> (-0.09%) ⬇️
plugins-test-and-upstream-graphql 40.16% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-koa 40.39% <ø> (-0.09%) ⬇️
plugins-test-and-upstream-protobufjs 38.94% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-rhea 44.13% <ø> (-0.10%) ⬇️
plugins-undici 39.12% <ø> (-0.09%) ⬇️
plugins-url 24.72% <ø> (ø)
plugins-valkey 38.04% <ø> (-0.13%) ⬇️
plugins-vm 24.72% <ø> (ø)
plugins-winston 34.00% <ø> (-0.09%) ⬇️
plugins-ws 41.92% <ø> (-0.09%) ⬇️
profiling-macos 39.85% <ø> (-0.10%) ⬇️
profiling-ubuntu 39.98% <ø> (-0.10%) ⬇️
profiling-windows 41.20% <ø> (-0.10%) ⬇️
serverless-azure-functions-client 23.75% <ø> (ø)
serverless-azure-functions-eventhubs 23.75% <ø> (ø)
serverless-azure-functions-servicebus 23.75% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pr-commenter
Copy link

pr-commenter bot commented Feb 18, 2026

Benchmarks

Benchmark execution time: 2026-02-23 15:45:52

Comparing candidate commit a304de5 in PR branch rob.carlan/DSMON-1226/kafkajs-dsm-backlog-cluster-id with baseline commit 631fb6a in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 225 metrics, 25 unstable metrics.

@robcarlan-datadog robcarlan-datadog changed the title fix(kafkajs): include kafka_cluster_id in DSM backlog offset tracking [DSMON-1226] fix(kafkajs): include kafka_cluster_id in DSM backlog offset tracking Feb 18, 2026
robcarlan-datadog and others added 2 commits February 19, 2026 14:52
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@datadog-datadog-prod-us1-2
Copy link

datadog-datadog-prod-us1-2 bot commented Feb 19, 2026

⚠️ Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🧪 1 Test failed

esbuild support for IAST cjs "before all" hook in "cjs" from cjs (Datadog) (Fix with Cursor)
Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/runner/work/dd-trace-js/dd-trace-js/integration-tests/appsec/iast-esbuild.spec.js)

Error: Timeout of 60000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/runner/work/dd-trace-js/dd-trace-js/integration-tests/appsec/iast-esbuild.spec.js)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)

ℹ️ Info

❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a304de5 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@robcarlan-datadog robcarlan-datadog marked this pull request as ready for review February 19, 2026 21:16
@robcarlan-datadog robcarlan-datadog requested review from a team as code owners February 19, 2026 21:16
johannbotha
johannbotha previously approved these changes Feb 19, 2026
Copy link
Contributor

@johannbotha johannbotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


consumer.run = function ({ eachMessage, eachBatch, ...runArgs }) {
const wrapConsume = (clusterId) => {
resolvedClusterId = clusterId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on an assumption of an implementation detail to work, which is that COMMIT_OFFSETS will always happen in the context of a run, synchronously, and that no 2 runs can run concurrently. Have you validated that this assumption is correct? If yes, then I would add a comment clarifying that as it's critical for this to work properly and for future readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I checked that's the case and added a comment
There should be only one run (here and the COMMIT_OFFSETS only happens in the offsetManager which is only used within the context of run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f github.com:DataDog/dd-trace-js into rob.carlan/DSMON-1226/kafkajs-dsm-backlog-cluster-id
@robcarlan-datadog robcarlan-datadog force-pushed the rob.carlan/DSMON-1226/kafkajs-dsm-backlog-cluster-id branch from fa7ed0e to a304de5 Compare February 23, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants